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Abstract 

We propose an exact simulation algorithm for lattice QCD with dynamical Kogut- 
Susskind fermion in which the iVy-flavor fermion operator is defined as the Nf /4-th 
root of the Kogut-Susskind (KS) fermion operator. The algorithm is an extension 
of the Polynomial Hybrid Monte Carlo (PHMC) algorithm to KS fermions. The 
fractional power of the KS fermion operator is approximated with a Hermitian 
Chebyshev polynomial, with which we can construct an algorithm for any number of 
flavors. The error which arises from the approximation is corrected by the Kennedy- 
Kuti noisy Metropolis test. Numerical simulations are performed for the two-flavor 
case for several lattice parameters in order to confirm the validity and the practical 
feasibility of the algorithm. In particular tests on a 16 4 lattice with a quark mass 
corresponding to nips/my ~ 0.68 are successfully accomplished. We conclude that 
our algorithm provides an attractive exact method for dynamical QCD simulations 
with KS fermions. 
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1 Introduction 



In numerical studies of lattice QCD, advancing simulations including dynam- 
ical quarks is the most pressing issue to confirm the validity of QCD and 
to extract low energy hadronic properties from it. While including dynami- 
cal quark effects is still a difficult task, recent developments of computational 
power and algorithms have enabled dynamical simulations of reasonable scale. 
Much efforts are thus beingspent to accurately compute physical quantities 
in full QCD simulations [IBB, H, 0, 0- 

A large number of these simulations are being made with Wilson-type fermion 
action using the Hybrid Monte Carlo (HMC) algorithm 0,0], which is one of 
the exact dynamical fermion algorithms. A limitation of the HMC algorithm is 
that the number of flavors has to be even to express the fermion determinant 
in terms of pseudo-fermion fields. Recently, however, the polynomial Hybrid 
Monte Carlo (PHMC) algorithm has been proposed to simulate the 
odd number of flavors with Wilson-type fermion as an exact algorithm 2, 3 . 
The combination of the HMC and PHMC algorithms enables us to simulate 
the realistic world of 2+1-flavors of quarks with lattice QCD [l2l. Ilif. 

The Kogut-Susskind (KS) fermion action has also been widely used in full 
QCD simulations. For the KS action, the application of the HMC algorithm 
is restricted to a multiple of four flavors due to the four-fold Dirac fermion 
content of a single KS fermion. Even if one adopts the usual assumption that 
the 1/4 power of the KS fermion determinant provides a lattice discretization 
of a single Dirac fermion determinant, efficient exact algorithms have not been 
known for two- or single-flavor quark with the KS formalism. For this reason, 
dynamical KS fermion simulations for two- or three-flavor QCD are made with 



the /^-algorithm [14j even today, which is an approximate algorithm. 



The i?-algorithm involves a systematic error from a finite step size of the 
molecular dynamics integrator. Strictly speaking, a careful extrapolation of 
physical quantities to the zero step size is required. Since this is too computer 
time consuming, numerical simulations are usually carried out at a finite step 
size which is chosen so that the systematic error is considered invisible com- 
pared to the statistical error. While checks on small lattices are usually made 
to ensure smallness of the systematic error at least for several quantities, the 
possibility that other physical quantities are spoiled by the finite step size 
effects is difficult to exclude. Even such checks become progressively more 
difficult as smaller quark masses require vastly increasingly computer time. 
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Clearly, an exact and efficient algorithm is desired for dynamical QCD sim- 
ulations with the KS fermion action not only for two flavors but also for a 
single flavor of quarks. 



In this article, we propose an exact algorithm for KS fermions which is capable 
of simulating an arbitrary number of flavors. Our algorithm is an application 
of the PHMC algorithm. It is an extension of the idea of the Rational Hybrid 
Monte Carlo (RHMC) algorithm Q put forward by Horvath, Kennedy, and 
Sint a few years ago. They briefly described their idea and tested their algo- 
rithm together with the PHMC algorithm for two- and four-flavor QCD with 
the KS fermions on a small size lattices. We shall comment on the difference 
between their algorithm and ours below. Recently Hasenfratz and Knechtli [r| 
also proposed an exact algorithm for KS fermions with hyper-cubic smeared 
links, which makes use of the polynomial approximation and global update 
algorithm. The algorithm is considered to be effective for the action for which 
the HMC type algorithm cannot be applied. 



The outline of our algorithms goes as follows. Applying a polynomial approx- 
imation to the fractional power of the KS fermion matrix, we rewrite the 
original partition function to a form suitable for the PHMC algorithm. The 
resulting effective partition function has two parts; one part is described by an 
effective action for the polynomial approximation of the fermion action, which 
can be treated by the HMC algorithm. The other part is the correction term 
which removes the systematic errors from the polynomial approximation. The 
correction term can be evaluated by the Kennedy-Kuti noisy Metropo- 
lis test, which has been successfully used in the multi-boson algorithm lH . 



With this combination, we can make an exact algorithm for KS fermions. In 
this work we describe the details of the algorithm, and report results of our 
numerical test on the applicability of the algorithm to realistic simulations. 



Since the polynomial approximation to rewrite the partition function is not 
unique, there can be several realizations of the PHMC algorithm for the KS 
fermion. We construct two types of realizations of the polynomial approxima- 
tion to the PHMC algorithm with Nf quark flavors: 



Case A Use a polynomial which approximates the Nf/8 power of the KS 
fermion matrix which corresponds to Nf/2 quark flavors. Introducing a 
single pseudo-fermion field with squaring the fermion matrix, we obtain Nf 
flavors of quarks. 

Case B Use an even-order polynomial which approximates the Nf/4 power 
of the KS fermion matrix, which corresponds to Nf flavors of quarks. To 
express Nf quark flavor with a single pseudo-fermion field, the even-order 
polynomial is factored into a product of two polynomials using the method 
by Alexandrou et al. [19]. 
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We estimate the computational cost of the two algorithms in terms of the num- 
ber of multiplication by the KS fermion matrix, and find that the algorithm 
B is roughly a factor two more efficient. 

We investigate the property and efficiency of the Chebyshev polynomial ap- 
proximation as an example of the polynomial approximation. The method to 
split the even-order polynomial into two polynomials, which is used in case 
B, is also described. The limitation of our method for splitting the even-order 
polynomial is discussed. 

The Kennedy-Kuti noisy Metropolis test involves a non-trivial factor. Since we 
take the fractional power of the KS operator, the correction term also includes 
fractional powers of fermion matrices. In order to evaluate the measure for 
the noisy Metropolis test acceptance rate, we need to evaluate the fractional 
power of the correction matrix. To do this, we make use of the Lanczos-based 
Krylov subspace method by Borigi j^], originally proposed for the inverses 
square root of the squared Hermitian Wilson-Dirac operator in the Neuberger's 
overlap fermion. We modify the Borici's algorithm to our purpose. 

Finally we investigate the validity and property of the PHMC algorithm (case 
B) in the case of two flavors of quarks numerically. We first confirm that our 
algorithm works correctly using a small lattice of a size 8 3 x 4, where a com- 
parison to the i?-algorithm is performed. On an 8 3 x 16 lattice we investigate 
the mass dependence of the algorithm. We find that for quark masses lighter 
than mps/my ~ 0.60 the polynomial with order larger than 0(600) is re- 
quired. Finally, we apply our algorithm to a moderately large lattice of 16 4 
with a rather heavy quark mass mps/my ~ 0.69 as a prototype for future re- 
alistic simulations. On this lattice violation of reversibility and convergence of 
the Lanczos-based Krylov subspace method for the noisy Metropolis test are 
investigated. We find satisfactory results; there is no visible reversibility vio- 
lation, and the Krylov subspace method converges within the limit of double 
precision arithmetic. The test run on the large size lattice shows reasonable 
efficiency on the computational time. 

The rest of the paper is organized as follows. In section 2, we introduce the 
lattice QCD partition function with the KS fermions, and rewrite it to a form 
suitable for the PHMC algorithm. We give two forms for the partition func- 
tion as described above. The outline of the PHMC algorithm is also shown 
in this section. In section 3, we describe the Chebyshev polynomial as a spe- 
cific choice for the polynomial approximation. The error of the polynomial 
approximation is investigated. The molecular dynamics (MD) step with the 
polynomial approximation in the PHMC algorithm is explained in section 4. 
The details of the noisy Metropolis test is given in section 5. We estimate the 
computational cost of the algorithms in section 6, where we find that the case 
B algorithm is better. In section 7, we show the test results with the PHMC 
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algorithm. Conclusions are given in the last section. 



2 Effective Action for PHMC algorithm 



The QCD partition function for Nf flavors of quarks using the KS fermion is 
defined by 



Z = fvUd*jt[D] N ''*e- 8 'M, 



(2-1) 



where S g [U] is the gauge action, U^in) is gauge links, and det[D] is the deter- 
minant of the KS fermion operator D. The KS fermion operator D is defined 
by 



D(n, m) = am5 nim + Vt^( n ) [u^(n)5 n+i x, m - E/J(n - p,)S n -fi, m , ] ■ (2.2) 



where am is the bare quark mass with lattice unit a, and rj^n) is the usual 
KS fermion phase. In our work we adopt the usual assumption that taking the 
fourth root of the KS fermion operator represents a lattice discretization of a 
single Dirac fermion operator in the continuum. 



The determinant of the KS fermion operator can be rewritten as 



det[D] = det 



det 


aml ee 




= det 


l ee 




M oe 


aml 00 




Doo 



(2.3) 



where D OQ = (am) 2 l 00 — M oe M eo with M eo (M oe ) the hopping matrix from odd 
site to even site (even site to odd site) defined in Eq. (2.2). Since D OQ is nothing 
but the odd part of D^D, the eigenvalues are real and positive semi-definite, 
which enable us to take the fractional power of the KS fermion operator except 
for vanishing quark masses. Thus the QCD partition function is reduced to 



VUdet[D 



]Nf/4 e -S B [U] m 



(2.4) 



To apply the PHMC algorithm, we approximate the fractional power of D OQ by 
a polynomial of D oc/ . We consider two methods for the polynomial approxima- 
tion. We restrict ourselves to the case that the number of flavors is one or two. 
Generalization to any integer flavors is achieved by combining the single-flavor 
and two-flavor cases. 
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Case A We introduce a polynomial Pjv o1 [a;] of order N po \ y , which approxi- 
mates x~ Nf / 8 for real and positive (non-zero) x as 



jv, 



X 



-N f /8 



p°i y 

P N poly [x] = C i X \ 
i=0 



(2.5) 



where q's are real coefficients. The property that the eigenvalues of the KS 
fermion operator D QO are bounded below by {am) 2 enables us to substitute D 00 
into Eq. (2.5) to approximate the fractional power of the KS fermion operator. 
Using this polynomial, we can rewrite the determinant as 



det 



Dr. 



Nf/4 



det 


D 00 (P Npoly [D 00 ]) 


S/N f - 


det 


{ P N poly [D 00 ] 


^/Nf 





det 



w[b 



N f /4 



det 



^poly [Doo] 



2 ■ 



where we introduced 



W[D 00 ] = D 00 (P Npoly [D c 



8/N f 



N f /4 



(2.6) 



(2.7) 



whose deviation from the identity matrix indicates the residual of the polyno- 
mial approximation. We refer to W[P 00 ] as the correction matrix. Note that 
the exponent of -P/v poly becomes an integer when Nf = 2 or Nf = 1 as we 
assumed. Introducing pseudo-fermion fields to the denominator of Eq. (2.6), 
we obtain 



det 



D n 



N f /4 



det 



W[D 



Nf/4 



J V<f>]p<f> e 



Sq[U,4> ,<t>o] 



\ ol y[ D Oo]<Po 



Thus the QCD partition function Eq. (2.4) becomes 



(2.8) 
(2.9) 

(2.10) 



Case B We define a polynomial P^ poly [x] with Af po i y even to approximate 
x -N f /4 f or rea j anc j p OS itive (non-zero) x by 



jv, 



poly 



X 



X 



(2.11) 



i=0 
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Similarly to the case A, we can rewrite the determinant as 



det 



Nf/4 


( det 


D 00 (P Npoly [D 00 )) 4/Nf 




det 


(p Npoly [D 00 ]) 4/Nf 





N f /4 



det 


W[D 00 \ 


Nf/4 


det 


P N poly [Doo] 



(2.12) 



where 



W[D 00 ] = D 00 (P Npoly [D 



4/N f 



(2.13) 



For A/p i y even, P Npoly [x] can be factored into two polynomials as employed 
in the multi-boson algorithm for single- flavor QCD |l9j |. Making use of this 
property, we obtain 



det 



D r . 



Hl det 



W[D 



N f /4 



det 



QN poly [D c 

where Q jv poly [x] is defined by 



2 > 



(2.14) 



ATpoiy/2 



QN poly [x] =PjV poly M, QiV poly N= d i X ^ 



(2.15) 



i=0 



with complex coefficients d{. The factoring is not unique. In the next section we 
describe the method for dividing the polynomial. Introducing pseudo-fermion 
fields to the denominator of Eq. (2.14), we obtain the QCD partition function 
for the PHMC algorithm in the case B as 



Z= \ VUV(j))P(j) &et[W[D 00 \v ' < 



Nf/4 -S g [U]-S q {U,4>l4>o] 



i ~ 2 

Sq[U, 4>l, 4> ] = Q Npoly [Doo]4>o ■ 

The PHMC algorithm follows two steps: 



(2.16) 
(2.17) 



(1) The HMC step with the effective action Eq. (2.10) for the case A (Eq. (2.16) 
for the case B). 

(2) The noisy Metropolis test to remove the systematic error represented by 
W[D 00 ] in Eq. (2.6) for the case A (Eq. (2.14) for the case B). 
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The n oisy Metropolis test in step (2) has been used in the multi-boson algo- 
rithm ^8||. The reweighting technique |TlJ and stochastic noisy estimator 12l | 



method can be applied to remove the systematic error. In this paper we employ 
the noisy Metropolis test to make our algorithm exact 



The original idea by Horvath et al. differs from ours. We separate the action 
into two pieces; the effective action with polynomial approximation and the 
correction factor. They do not separate the full action and apply the Kennedy- 
Kuti noisy estimator for the energy conservation violation dH itself to make 
the algorithm exact, while we apply the method to the correction factor. It is 
not known which approach is more efficient. We leave a study of this issue to 
future work as a comparison of the two algorithms is beyond the scope of the 
present paper. 



3 Polynomial approximation 



We employ the Chebyshev polynomial approximation to x~ s with a real pos- 
itive x, where s takes the following values: 1/2, 1/4, 1/8, depending on the 
choice of case A or B, and on Nf. We explain the application to the KS 
fermion operator and investigate the relation among the order of polynomial, 
the residual of the approximation, and the choice of case A or B. 

Several polynomial approximations for 1/x have been proposed in studies of 



the multi-boson algorithm. They include Chebyshev 21], adapted [22j, and 
least square |23| polynomials. The choice of the polynomial affects the effi- 
ciency, i.e., how much one can decrease the polynomial order so as to make 
the correction matrix as close to the unit matrix as possible. Since this is not 
a problem of the simulation algorithm itself, we employ the simple choice of 
the Chebyshev polynomial approximation for the fractional power of the KS 
fermion operator. 



3.1 Chebyshev approximation 

The Chebyshev polynomial expansion of x~ s is 

oo 

x - = [1 + (1 - e)y]- s = CkT k [y), (3.1) 

k=0 



where y = (x — 1)/(1 — e), e is optimized so as to satisfy y G [—1,1] depending 
on the support of x and e G (0, 1). We restrict the support of exponent s to 
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s G (0, 1]. 1 T k [x] is the k-th order Chebyshev polynomial denned by 

Tk[x] = cos(/c arccos(x)). (3.2) 



The polynomial has the following recurrence formula: 

T k [x] = a fc _ixT fe _![x] + / 9 fc _ 2 T fc _ 2 [x] {k > 1), (3.3) 



where a fc = 2/(1 + 4,o) (A: > 0), /3_i = 0, j3 k = -1 (A; > 0), and T [x] = 1. The 
Chebyshev polynomial expansion coefficients c k in Eq. (3.1) are calculated as 



^ (1 + r 2 ) s F(s, s + k-l + k- r 2 ) J{l+ k) - r , (3.4) 



i + <V 7 ' ' ' ' 7 r( s )r(i + A;)' 



where r = + y e(2 — e)J /(l — e), F(a, /3; 7; z) Gaussian hyper-geometric 
function, and T(z) Gamma function. Truncating Eq. (3.1) at order N po \ y , we 
approximate x~ s by 

-^poly 

x s = [l + (l_ e )y]- s ~ P Wpoly [x] = £ c^fo]. (3.5) 

fe=0 



The truncation error is bounded by 

x--P„ \x\ < 2 A + r 2 V (-r)Wi 



where r takes a value in the region — 1 < r < 0. This inequality is not opti- 
mal. It demonstrates, however, an exponential decrease of the residual with 
increasing iV po i y . When e <C 1, it is expected that iV po i y oc y/e at a constant 
residual. 

The operator corresponding to y in the above relation is given by shifting and 
changing the normalization of the KS fermion operator: 



1 The fractional power inverse of a matrix is also given by Gegenbauer polynomial 
expansion 
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where (aA) 2 = 2(aA max ) 2 /(4(am) 2 +2(aA max ) 2 ), and M 00 = l 00 +2M oe M eo /(aA max ) 2 . 
oA max is introduced to keep all eigenvalues of — M OQ in the domain [—1, 1]. Since 
the normalization in front of D DO in Eq. (3.7) can be absorbed into the nor- 
malization of the partition function, we use D' OQ instead of D 00 and omit the 
prime symbol from D' 00 in the rest of the paper. For the free case, aA max = 2 
is sufficient to satisfy the condition for the eigenvalues of — M 00 . In the inter- 
acting case, it becomes larger than two. This is seen, for example, in a study 
for SU(2) lattice gauge theory where the complete eigenvalue distribution has 
been investigated |25|]. With this expression for the KS fermion operator, the 
polynomial approximation of becomes 

D- S ~ PN poly [Doo] = £ c k T k [-M 00 ], (3.8) 

i=0 



where q is obtained from Eq. (3.4) with e = 1 — (aA) 2 . The exponent s is 
chosen to be s = Nf/8 for the case A and s = Nf/A for the case B. 

For the case B, we have to solve Eq. (2.15) to obtain the half-order polynomial 
<5AT poly . Here we choose to construct Qn po17 so as to have the form of the 
Chebyshev polynomial expansion as Eq. (3.8). Since this problem is rather 
complicated, we postpone the discussion to the subsection at the end of this 
section, and proceed assuming that the polynomial <5Ar poly and its coefficients 
di are already given. 

Given the Chebyshev polynomial expansion coefficients c m (d m ), we can eval- 
uate the polynomial Pn o1 N (QN po i y [ x ]) using the Clenshaw's recurrence for- 
mula (for example, see |2q). When x is the KS fermion operator D 00 , the 
multiplication of the operator -P/v poly [D 00 ] on a vector v is carried out by the 
following three step recurrence formula: 

y<*> = a k (-M 00 )y^ + (5 k y^ + c k v , (3.9) 



where y^ is a working vector labeled k, a k and (3 k are given in Eq. (3.3), and 
Cfc is the Chebyshev polynomial expansion coefficient. Solving for y^ with 
this equation from k = N po \ y to k = with the initial condition yi NpolyS> = 
yo = 0, we obtain 

PN poly [D 00 ]v = yi°\ (3.10) 



For Q Npoly [D 00 ]v , d k is used instead of c k and k runs from N po[y /2 to 0. Note 
that we need not store all working vectors y^; only two working vectors 
are required for the computation. This method can be applied to any ma- 
trix polynomial which has the same structure for the recurrence relation as 
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Fig. 1. R(x) as a function of y = (x - 1)/(1 - e) at 7V po i y /s = 800, e = 1/1000. 
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Fig. 2. iVpoiy/s dependence of the integrated residuals v R 2 - e = 1/1000 are plotted. 

Eq. (3.3). The computational cost to calculate P/v poly [/) 00 ]f (QN poly [D 00 ]v ) is 
^poiy (-^Vpoiy/2) by means of the number of matrix-vector multiplication. 

Now we discuss the quality of the polynomial approximation. We evaluate the 
polynomial approximation residual using 



R(x)= x(P Nvol \x]) 1/s ~ 1 



(3.11) 



Figure 1 shows the residual of the approximation as a function of y — (x — 
1)/(1 — e). The calculation is performed using Clenshaw's recurrence in dou- 
ble precision arithmetic. In order to compare the approximation at the fixed 
computational cost, the number of matrix- vector multiplication, in calculating 
the correction matrix irrespective of the choice of s (case A or B and Nf), we 
keep iVp iy/s constant (N po \ y /s = 800) as an example. We observe that the 
approximation becomes worse for smaller s (s = 1 case reaches the limit of 
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double precision arithmetic). We also investigate the N po \ y /s dependence of 
the polynomial approximation. In figure 2 we plot the N po \ y /s dependence of 
the residual defined by 



\ 



'dyRix^y. (3.12) 



Clear exponential decay is observed. The dependence of the slope on the choice 
of s indicates that the computational cost increases with decreasing s. In our 
case, defining the polynomial order by A^ ly and N^ oly for the case A and B, 
respectively it is expected that Np Qly /(N f /8) ~ 2iV ] £ ly /(iV / /4) holds when we 
require that the integrated residual takes a similar value for the two cases. We 
thus find N^ oly ~ JVg ly at the same value of the polynomial residual. 



3.2 Determination of QN poly in case B 

Here we discuss how to solve Eq. (2.15) to obtain the half-order polynomial 
Qn po1y - In our work Qn po17 should have the Chebyshev polynomial expansion 
form of Eq. (3.8). A simple procedure to obtain Qn po1y is as follows: (i) ex- 
press Eq. (3.5) as a polynomial of y instead of the expansion of the Cheby- 
shev polynomial, (ii) split the polynomial into the product of two polynomials 
like Eq. (2.15) by means of usual polynomial expansion, and (iii) recover the 
Chebyshev polynomial expansion for the half-order polynomial. However, this 
method has a numerical problem in step (iii). In order to make this point 
clear, and to present an alternative procedure, let us elaborate on the proce- 
dure above. 

In the step (i), we expand the Chebyshev polynomial to write 

E c kT k [y] = 

k=0 fc=0 



Pn po1y [x] = £ c k T k [y] = £ c' k y k , (3.13) 



where c' k can be obtained from the original c k using the appropriate recurrence 
formula (see ex. j^J). In the step (ii), finding the roots of the polynomial 
Y^ k =o y c 'kV k = 0; we obtain the product representation: 

-^poly -^poly 

E 4/ = 4 poly ll(v-Zk), (3-14) 

fc=0 fc=l 



where z^s are the roots of the polynomial. Because iVp i y is even and the 
coefficients c' fc 's are real, the root Zk pairs with its complex conjugate zy = z\. 
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Thus we can split the polynomial to the product of two polynomials [Tgj | : 



-?Vpoly ^poly/2 

E c kV k = 4 DOlv II (y~ z m)(y 



-^poly 

fc=0 fc=l 



1/2 Wpoly/2 

(4 poly ) " II (v~ z j 



(k), 



k=l 



(3.15) 



where j(k) is a reordering index to represent the pairing condition above. The 
explicit form of the reordering is not unique depending on how to distribute the 
complex pairs in the monomials, and several methods have been proposed [i^, 



In the third step (iii), we calculate the half-order polynomial according to 



1/2 Npoly/2 Npoly/2 

(4 po J II (y~ z m)= E d kv\ 

k=l k=0 

AWy/ 2 

= J2 d kTk[y], 

k=0 



(3.16) 



where d' k s are obtained from z k and c' N by expanding the product represen- 
tation, and d k s are extracted from d' k s using an appropriate reverse recurrence 



formula 26] (i.e. the relation opposite to Eq. (3.13)). In this way, we could 
derive d k s from c k s so as to satisfy 



iV, 



poly 



Pn po1y [x] = E c k T k [y] 

k=0 



Afpoly/2 

E d kT k [y] 

k=0 



(3.17) 



Unfortunately, we find that the reverse recurrence to extract the Cheby- 
shev polynomial expansion coefficients d k from the usual polynomial coeffi- 
cients d' k s is numerically unstable 26]. Therefore we decided to directly solve 
Eq. (3.17) with respect to d k . 

Using the addition relation, T k [y]Ti[y] = {T k+ i[y) + T\ k _ l \[y])/2, to Eq. (3.17), 
we extract the following second-order simultaneous equations, 



f k ({d}, {tf }) = ca, (k = 0,---,N poly ), 



(3.18) 



where f k ({d}, {d*}) depends on the sets {d} = {d\, di, ■ ■ ■ , ^7v poly /2} and {cf } = 
{d\, c?2, • • • , ^iv poly / 2 }- We do not write down the explicit form of f k {{d}, {d*}) 
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Fig. 3. Polynomial coefficients d; L for QN po i y [x] at e = 0.0001, iVpoly = 600, and 
Nf = 2 in complex plane. 




-1.00 -0.50 0.00 0.50 1.00 

y 



Fig. 4. Relative error, \P Npoly [x] - Q Npoly [x]Q* Npoiy [x]\/\P Npoly [x]\, plotted against y, 
where x = 1 + (1 — e)y. Parameters are the same as those in Fig. 3. 

here because of its length and complicated form. The solution to these equa- 
tions is not unique. This corresponds to the reordering ambiguity in the pre- 
vious case of splitting the product representation. 

We solve numerically the simultaneous equation Eq. (3.18) using Mathematica 
with desired accuracy starting from an initial choice of {d} and {d*}. The 
accuracy of the solution is examined by numerically evaluating the relation 
Eq. (3.17). Figure 3 shows the polynomial coefficients di for Qn po1y [x} derived 
by the direct method using Eq. (3.18). The accuracy of the solution stays at 
satisfactory level within double precision arithmetic as observed in Figure 4, 
where the polynomials are evaluated with Clenshaw's recurrence formula in 
double precision arithmetic. 

A practical limitation with our direct method is that it is rather slow. On a 
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Linux PC with 1 GHz Pentium III CPU, we could find the coefficients only for 
iVp i y < 800 within a tolerable computational time. A more efficient methods 
to solve Eq. (2.15) is desired. 



4 Calculation of the polynomial pseudo-fermion force in the Hy- 
brid Monte Carlo algorithm 

We apply the usual HMC algorithm to the partition function Eq. (2.10) (case 
A) or Eq. (2.16) (case B), introducing a fictitious time and canonical mo- 
menta Pjj,{n) to the link variables U^{n). A nontrivial task is the calculation 
of the molecular dynamics (MD) force from the quark action expressed by the 
polynomial approximation. We utilize the Clenshaw's recurrence formula for 
this purpose. Here we first discuss the variation of the polynomial of a matrix 
A for the general case, and then describe the force calculation in the HMC 
algorithm. 

Let P/v[A] be a matrix polynomial of A with order N described by 

N 

P N [A]=Y,CiHA], (4.1) 

i=0 

where [A] is defined to have the following recurrence relation: 

$i[A] = ai-iAQi^A] + A_ 2 $;_ 2 L4]. (4.2) 

In most cases $o[A] is a constant and set to be unity. As in Eq. (3.9), Pn [A] 
is evaluated by the Clenshaw's recurrence formula: 

= a k A + /3 fe y (fe+2) + c fc l, (4.3) 

where yW's are working matrices, Y^ N+1 ^ = y( JV ) — 0, k runs from N to 0, 
and -PjvfA] = Y( \ We take the variation of Eq. (4.1) with respect to A, and 
denote it by (5P/vL4]: 

N 

6P N [A] = Y,6$ i [A]c i . (4.4) 

i=0 

The variation 5<E>jL4] also has the recurrence formula obtained by differentiat- 
ing Eq. (4.2): 

8®i[A] = ai-rf&i-^A + A_ 2 <^_ 2 L4] + [A\5A. (4.5) 
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Substituting Eq. (4.3) and Eq. (4.5) into Eq. (4.4), we have 



N 



6P N [A] = Y,<Xi-i*i-i[A]8A 



(4.6) 



i=i 



where <5$o[^4] — and a similar technique to the Clenshaw's formula is used. 

Applying Eq. (4.6) to our problem, we can evaluate the variation of the pseudo- 
fermion action Eq. (2.9) with respect to the infinitesimal change of link vari- 
ables as 



5S q = 5\P Npoly [-M 00 )<p o 

^poly , 

= £ (a l _ 1 T,„ 1 [-M 00 ]y(°)) T (-5M 00 )y« + h.c., 



(4.7) 



i=i 



where h.c. means the Hermitian conjugate of the preceding term, and sat- 
isfies Eq. (3.9) with v = <f> . We used the Hermiticity of at and P Npoly [—M 00 ], 

and = -P/v poly [— M 00 ]4> was applied in the last line. A more convenient 
form of 5S q for practical calculations is given by 



5S, 



J2 Oi-i X^UM Z®+h.c, 



q (aA ) 2 
where and M are defined as 



(4.8) 



SM- 



( -M Pn x^ \ 



x 



eo-^o 

(0 



M eo yf 



(0 



5M m 



6M oe 



(4.9) 



with 



yf = a t (-M 00 )y^ + A^ +2) + Q0 O (i = N poly , • • • , 0), 



xf = a^ 2 (-M 00 )xt 1] + Pi- 3 xt 2) {i = 2, • • • , AW), 



x 



(1) _ „(0) 



o fo 



(4.10) 
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In the actual calculation, we first calculate j/W from i = N po \ y to i = and store 

on memory. We then sum up each of the force contribution 5M 
evaluating from i — 1 to i — N po \ y . A similar method, which has a more 
complicated form, has been obtained for the HMC algorithm with the overlap 
fermions by C. Liu 28]. Since the even component of and appear as 
a byproduct of the multiplication of M OQ in the recurrence equation, we do 
not need extra calculations for the even components. The explicit form of the 
force contribution is obtained from these equations as usual. 

We have described the force calculation for the case A as an example. The 
force calculation for the case B is almost identical except for the replacement 
Npoiy -> Apoiy/2 and c* -> d { . 

Eqs. (4.8), (4.9), and (4.10) do not contain an iterative procedure. This leads 
us to expect that the reversibility violation of the MD evolution is smaller than 
in the usual HMC algorithm with four-flavor KS fermions in which an iterative 
solver such as the Conjugate Gradient (CG) method is used to invert the KS 
fermion operator. Our implementation, however, still involves the possibility 
that round-off errors grow to violate the reversibility in the summation of 
jJfW SM Z® from i — 1 to % — N po \ y . We investigate this issue in section 7 
on a realistic size lattice. 

The external field <p is generated at the beginning of every MD trajectory 
according to the pseudo-fermion action Eq. (2.9) (case A) (Eq. (2.17) (case B)) 
using the global heat-bath method. This is achieved by solving the following 
equations with respect to <p : 



PN poly [D 00 ]4> = X o (case A), (4.11) 
Q Npoly [Doo]^o = Xo (caseB), (4.12) 

where \o is a Gaussian noise vector with unit variance. Using the identity 



<Po = (PN poly [Doo])- 1 Xo 

= D 00 (P Npo jD 00 ]f /N f- 1) W[D 00 ]- 1 Xo (case A), (4.13) 

<Po = (QN polY [DooD^Xo 

= ^oo(g7v pol J^oo]) t (P7v pol J J Doo]) (4/Af/ - 1) ^[ J D 00 ]- 1 X o (case B), (4.14) 

we invert the coefficient matrix -P/v poly [D o] (case A) (QAf poly [-D o] (case B)) 
using the CG solver to W[Z? 00 ]. 
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5 Noisy Metropolis test 



In order to make algorithm exact, we have to take into account the correction 
term det[JV [.Doo]]^ 4 - This is achieved by a noisy Kennedy-Kuti |l7| Metropo- 
lis test. This method has been used in the multi-boson algorithm |l8| . In this 
section we explain the details of the noisy Metropolis test. 



5. 1 Definition 



When a trial configuration U' which is generated by the molecular dynamics 
step in the HMC part of the algorithm is accepted at the HMC Metropolis test, 
we make the noisy Metropolis test for the correction factor detfl^ [.Doo]] 7 ^ 4 - 
The acceptance probability of the trial configuration U' from an initial con- 
figuration U is defined by 



P C orr[U -> U') = min 



l,e 



-dS[U,U'] 



(5.1) 



with 



dS[U, U'\ 



W[D' 0( 



-Nf/8 



W[D 



ooj '/o 



\Vo\ 



(5.2) 



where n is a Gaussian random vector with unit variance. This probability 
satisfies the detailed balance relation 3|. The factor VK[-Dq ] is calculated on 



the trial configuration U', while VT[.D 00 ] on the initial configuration U. 
Eq. (5.2) can be modified to 



dS[u,u'} = Cw[D'j' f/ c-kl 2 , 

( = W[D 00 ] N f/ s r ]o . (5.3) 

We employ Eq. (5.3) for dS in the present work. It is not known at present 
which of the expressions, Eq. (5.2) or (5.3), is more useful for calculating dS 
in respect of computational efficiency and accuracy, which is left for future 
studies. 

In order to evaluate Eq. (5.3), we need to calculate the fractional power of the 
matrix W[D 00 ] (or W[D' 00 }). In Ref. we used a Taylor expansion method 
for the (inverse-) square root of the correction matrix with the 0(a)-improved 
Wilson fermions. In order to suppress the truncation error of the Taylor expan- 
sion we explicitly calculate and monitor the residual for the Taylor expansion 
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for which we need an extra computational cost. The direct application of this 
method to Eq. (5.2) requires more computational overhead compared to the 
Wilson case, since the residuals contains several application of W[D 00 ] N f' 8 and 
W{D 00 ]~ N f/ 4 , (e.g. for Nf = 2 case, 8 times, and 4 times to define the residuals, 
respectively). Instead of this method we employ the Krylov subspace method 
to avoid the explicit calculation of the residuals as described in the following. 

Here we roughly estimate the order of polynomial A" po i y required to achieve a 
given acceptance rate at a volume V and aA max . The acceptance rate Eq. (5.3) 
behaves as dS = [constant] x V x (— r) 7V p ol y +1 , where r is defined in Eq. (3.4). 
To keep dS < S with a small constant 5 that maintains sufficient acceptance 
rate, we need 

iVpoiy > (In V - In 6 + [constant]) , (5.4) 



where we used r ~ — 1 + 2(am)/(aA max ) and am <C «A max from the definition 
of r in Eq. (3.7). Therefore we need to increase N po \ y linearly as increasing 
the condition number (aA max )/(am), and logarithmically with volume V. A 
similar discussion on the required A" po i y can be found in Refs. j3] in the 
literature of the multi-boson algorithm. 



5.2 The Krylov subspace method 



Since the matrix is Hermitian and positive definite with the KS fermion, and 
already well preconditioned, we can take the fractional power with the Krylov 
subspace method with better efficiency HB A Lanczos-based Krylov sub- 
space method was developed by Borigi [20|, which was utilized for the cal- 
culation of inverse square root of squared Hermitian Wilson-Dirac operator 
in the Neuberger's overlap operator. The method is an application of the 
Krylov subspace method to calculate functions of large sparse matrices |29| . 
The Lanczos-based Krylov subspace method enables us to define a kind of 
residual without explicit residual calculation. We apply this method to evalu- 
ate the fractional power of the correction matrix. 

Consider a matrix function multiplied on a vector, f(A)b, with an n x n Her- 
mitian matrix A and an n component vector b in general. The fc-dimensional 
orthogonal basis Qk = (<li, 4 ■ ■ ,Qk), which spans the Krylov subspace with 
respect to the matrix A, is obtained by the Lanczos algorithm. This basis 
satisfies the following condition: 

AQk = QkT k + (3 k q k+1 (eP) T } q l = p x b, p x = l/\b\, (5.5) 
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where T k is a k x k tridiagonal matrix whose diagonal and sub-diagonal parts 
are a%, a 2 , • • • , a k and Pi, /3 2 , • • • , Pk-i, respectively. e!$ is the unit vector in 
the m-th direction in n-dimension. The superscript T means transpose. 

If j3k is sufficiently small after k iterations of the Lanczos method, the matrix 
function f(A) acting on the vector b is approximated by 

f(A)b~Q k f(T k )eP\b\. (5.6) 



In our case, A = W[D 00 ], f(x) = x~ N f /4 , and b = rj Q , or A = W' 00 , f(x) = 
x Nf ^ 8 , and b = ( . The dimension k is much smaller than that of A when A is 
close to unity. Thus, the calculation of f(A) is reduced to that of f(Tk) with 
smaller computational cost. 



Our algorithm is almost identical to that of Ref . [20( . We show the algorithm 
to calculate the matrix function f(A)b with the Lanczos-based method in Al- 
gorithm 1. We compute by diagonalizing Tj. with LAPACK subroutines. 
If the algorithm stops after k iterations, we have an approximate solution to 
the matrix function f(A) given by Xk ~ f{A)b. 

In our algorithm, a large cancellation error can occur in the Gram-Schmidt or- 
thogonalization step because our correction matrix is well conditioned VTf-Doo] ~ 
1. We therefore implement a full reorthogonalization step in our algorithm to 
keep the orthogonality among the Lanczos vectors 

Our algorithm requires k working vectors to store the Lanczos vectors. How- 
ever, we already employ iV P oiy working vectors to calculate the quark force in 
the HMC step, which we can reuse for the Lanczos vectors. A possible problem 
that the dimension of Krylov subspace, k, exceeds the number of working vec- 
tors, Ap i y , do not arise in practice for large A^ po i y because large A^ poly means 
that W is very close to unity so that the Lanczos algorithm stops at earlier 
steps. 

We can consider various types of stopping condition. For example, ern is 
based on the residual in CG algorithm for the calculation of A~ l b [30], and 
err 2 on a comparison of the successive approximation Xj as described in Algo- 
rithm 1. In order to avoid explicit residual calculation, these stopping criteria 
should ensure that the residual decreases to sufficient level during the itera- 
tion. For the CG-based stopping condition, which was originally introduced by 
Borigi 2£j, the analytical relation between the CG-based stopping condition 



and the truncation error of the Lanczos iteration is discussed by van den Eshof 
et al. j3l|, where it is proved that the CG-based stopping condition for the 
inverse square root in the overlap operator is a safe stopping condition. 

We cannot directly apply their proof to our case because the exponent of the 
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Algorithm 1. Algorithm for matrix function x = f(A)b. 

Pi = 1/|&|; Qi = bpi 
for % — 1, 2, • • • do 

r = Aqi 

if i = 1 then 
v — r 

else 

v = r - Pi-tfi-! 

end if 

oii = qjv 

v := v - aiqi 

for j — i, % — 1, • • • , 2, 1 do 

7 = gju 
f := v — •yqj 
end for 

A = M 

if i = 1 then 

Pi+i = -pm/Pi 

else 

Pi+l = ~ {Pi&i + Pi-lPi-l) /Pi 

end if 

err i = \l/p i+ i\ 
q i+1 = v/Pi 

set = «i, (T;) i+M = = Pi, otherwise (Tj)^ = 0. 

diagonalize = UikiUj ', where is the (i x i) tridiagonal matrix. 

for j — 1, • • • , i do 
. "I - (ti^jjqj 
end for 

(err 2 = - a^/l^-il) 

if err! < to/ (err 2 < tol) then 

exit 
end if 
end for 

solution x = Xi ~ f(A)b. 



correction matrix is not limited to —1/2. We employ the CG-based stopping 
condition erri in the noisy Metropolis test, and numerically verify the va- 
lidity of this choice by observing the convergence behavior of the residuals 
and dS. This analysis is described in section 7. We leave the mathematical 
proof whether the CG stopping condition is safe or not for our case for future 
studies. 
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6 Cost estimate 



The computational cost of our algorithm is measured by the number of mul- 
tiplication with M 00 for traversing a single unit of trajectory. The number 
of multiplication is separately estimated for the HMC part and the noisy 
Metropolis part. We define the algorithmic parameters as follows: 

• the number of MD step : N^ D 

• the number of iteration of the CG algorithm : N GG 

• the order of polynomial : N^ oly 

where the superscript A refers to the case A algorithm, which should be re- 
placed with B for the case B algorithm. We use the single leapfrog scheme for 
the MD integrator. 

Cost of HMC part The computational cost of the HMC part of the al- 
gorithm is divided into three pieces; calculation of the MD force with the 
polynomial pseudo-fermion, the generation of pseudo-fermion field according 
to the polynomial action, and the calculation of the total Hamiltonian for the 
HMC Metropolis test. 

From Eqs. (4.8), (4.9), and (4.10) in section 4, the cost of the force calculation 
in the HMC algorithm is estimated as 

< D x (2A£ ly - 1) (case A), 
Kb x « oly - 1) (caseB). 

From Eqs. (4.13) and (4.14), the computational cost to generate the pseudo- 
fermion field is estimated as 

((8/JV» x N^ oly + 1) x N&, + (8/N f - 1) x A^ Iy + 1 (case A), 



((4/iV» x N» ly + 1) x N* G + (4/iV> - 1) x N^ oly + ATf oly /2 + 1 

(case B). 

The computational cost of the Hamiltonian comes from the calculation of the 
pseudo-fermion action at the end of the MD step. The number of multiplication 
of M OQ is estimated as N^ oly for the case A and N^ oly /2 for the case B. 

Summarizing, the computational cost of the HMC part of our algorithm is 
given by 
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^hmc cost = (2< oIy - 1) x < D + ((8/N f ) x NX + 1) x N&, 

+ (8/N f -l)xNX + l, (6.1) 

for the case A, and 



Niuc cost = «oi y - 1) x N* D + ((4/JV» x JV*, + 1) x A<? G 

+(4/iV> - 1) x 7V p s oly + + 1, (6.2) 

for the case B. 

We observed in section 3 that N^ oly ~ ^poiy a ^ a comparable value of the 
polynomial approximation residual Eq. (3.12). Assuming A^ D ~ and 
N£ G ~ JV£ G , we find A^ MC cost ~ 2A^ MC cost . We conclude that the cost of 
HMC part of the algorithm is twice better for the case B than for the case A. 



Cost of noisy Metropolis part Here we estimate the cost of the noisy 
Metropolis test, N^ MP cost and iVjf MP cost , for both cases. The cost arises from 
Eq. (5.3), and is estimated as 

<mp cost = ((8/JV» x A p A oIy + 1) x N$ G x 2, (6.3) 



for the case A, and 

Cp cost = ((4/iV/) x N^ oly + 1) x N* G x 2, (6.4) 



for the case B. The factor 2 arises since we need to call twice the Lanczos- 
based algorithm to calculate W[D 00 \ N f/ 8 r) and W[D' 00 \ N f/\ . Here we used 
N GG (N GG ) as the number of Lanczos iteration. This is because the number 
of Lanczos iteration is expected to be almost identical to that of CG iteration 
to generate the pseudo-fermion field, when we employ the CG based stopping 
criterion. We expect N GG ~ N GG and Ap^ ly ~ N^ oly as before. Then we find 



Total cost Combining the result on the computational cost for the HMC 
step and that for the noisy Metropolis test described above, we find that the 
cost for the case A algorithm is larger than that for the case B algorithm by 
a factor two when the approximated polynomials for the two algorithms have 
the same residual on the correction matrix. In our numerical test described in 
the next section, we employ the case B algorithm. 
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7 Numerical tests 



We test our algorithm (case B) with the Chebyshev polynomial on three lat- 
tices in the two-flavor case. The simulation program is written in the optimized 
FORTRAN90 on SR8000 model Fl at KEK. Double precision arithmetic is 
applied to the whole numerical operations. The lattice volume, gauge coupling, 
and quark masses we employed are shown in Table 1. The "small size" lattice 
is used to investigate the basic property of the algorithm. We show the A po i y 
dependence of {dS) and the averaged acceptance rate of the noisy Metropolis 
test (P CO rr)- The iVpoiy dependence of averaged plaquette (P) is also presented 
on this lattice, and a comparison to results from the i?-algorithm is also shown. 
Using the "middle size" lattice, we compare the A po i y dependence of {dS) for 
different quark masses. We employ the "large size" lattice parameter to see 
the applicability of our algorithm to realistic simulations, where we check the 
reversibility of the MD evolution and the validity of the CG-based stopping 
criterion for the Lanczos algorithm in the noisy Metropolis test. A comparison 
of (P) to the /^-algorithm is also made. 

The unit trajectory length is chosen to be 1 in the following. We employ the 
single leapfrog scheme for the MD evolution, and call the number of MD step 
as NyiD- The parameter aA max is roughly optimized during the thermalization 
period for each lattice parameter. 



7. 1 Results on the small size lattice 

On the small size lattice, aA max is chosen to be 2.37. Figure 5 shows the A po i y 
dependence of (dS) for this lattice, where the number of MD step is Amd = 25. 
The dotted line shows a two-parameter fit to aexp(— bN po \ y ). We observe a 
clear exponential decay as expected. 

Figure 6 shows the A poly dependence of the averaged noisy Metropolis ac- 
ceptance rate (P CO rr)- We observe consistent results to the theoretical ansatz 

1 /2 

erfc((aexp(— 6A poly )) ' /2), where the dotted curve shows the ansatz with a 
and b obtained in Figure 5. 

Table 1 

Simulation parameter 





volume [3 am 


Small size 
Middle size 
Large size 


8 3 x 4 5.26 0.025 
8 3 x 16 5.70 0.01 and 0.02 
16 4 5.70 0.02 
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Fig. 5. iVp i y dependence of (dS) on the small size lattice. 
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Fig. 6. iVp iy dependence of noisy Metropolis test acceptance rate on the small size 
lattice. 

We show the iVp i y dependence of the averaged plaquette (P) in Figure 7. The 
horizontal line shows the constant fit to the results. The fit error is presented 
by the dashed lines. The results does not depend on iVp i y as it should be. In 
Figure 8, we plot the MD step size, dt = 1/Nmd, dependence of the averaged 
plaquette together with the results with the i?-algorithm. We employ N po i y = 
200 for the PHMC algorithm in the figure. Since the /^-algorithm has 0(dt 2 ) 
errors, we fit the results with the i?-algorithm with f(dt) = adt 2 + bdt 3 + 
c as shown by the dotted curve. The horizontal lines show the constant fit 
to the PHMC results again. The result with the PHMC algorithm does not 
depend on dt and is consistent with that of the zero step size limit of the 
i?-algorithm. With these observations we conclude that the PHMC algorithm 
works perfectly on the small size lattice. 
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Fig. 7. iVp iy dependence of the averaged plaquette on small size lattice. 
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dt 



Fig. 8. MD step size dt dependence of the averaged plaquette on the small size 
lattice. The dotted curve shows the fit with f{dt) = adt 2 + bdt 3 + c for the results 
with i?-algorithm. 



7.2 Results on the middle size lattice 



We plot the iV po i y dependence of (dS) for am = 0.01 and 0.02 in Figs. 9 and 
10, respectively. Here Nmt> = 50 and aA max = 2.28 are used for both masses. 
A clear exponential decay is observed in both figures. These behaviors are 
similar to those seen for the small size lattice. 

For the am = 0.01 case, which corresponds to the ratio of pseudo-scalar and 
vector meson masses m PS /my ~ 0.61 32], we need N poly ~ 500-600 for rea- 
sonable acceptance rate for the noisy Metropolis test. Moreover, assuming 
aA max /am <^ 1 and independence of aA max from am, we expect that A" poly 
behaves as aA max / am in order to keep the residual at a constant level (see be- 
low Eq. (3.6)). This leads us to suspect that for simulations with much lighter 
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Fig. 9. Same as Figure 5, but for the middle size lattice with am = 0.01. 
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Fig. 10. Same as Figure 5, but for the middle size lattice with am = 0.02. 
quark masses a polynomial of order 0(1000) is required. 

Our PHMC algorithm has several problems with such a large order polynomial. 
One is the memory cost in the calculation of the MD force from the polynomial 
pseudo-fermion. Fortunately this would not be a serious hindrance in nowadays 
high performance computing since memory cost is relatively low compared to 
the Wilson fermions (the KS fermions do not have spin indices). Another 
problem is the extraction of the polynomial coefficients of Qn po17 from the 
original polynomial -Pjv o1 as described in section 2. 



7.3 Results on the large size lattice 



We show the results on the large size lattice in Table 2. We quote the averaged 
plaquette value with the .R-algorithm from Ref . |32| • We observe a roughly 2a 
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deviation, which may be ascribed to a finite step size error inherent in the 
.R-algorithm. The number of multiplication of hopping matrices M eo and M oe 
measured in the program is roughly proportional to N po \ y , which is expected 
from the discussion in section 6. We will discuss the finite step size error of 
the .R-algorithm and compare the computational cost of the PHMC algorithm 
to that of the .R-algorithm after describing the numerical property of the re- 
versibility violation, the Lanczos algorithm, and dS for the PHMC algorithm. 

As described in section 4, we investigate the reversibility violation on the large 
size lattice using the following observables: 



AH/H=\H{t r - tr) - H(0)\/H(0), 



AU =JT, \(Up)a,b(n)(t r ~ U) ~ (^)^H(0)|, 
y n,fi,a,b 

AP = / £ \WaM(tr ~ t r ) ~ (P,) a ,b^Ml 
y n,fi,a,b 



(7.1) 
(7.2) 

(7.3) 



where X(0) means the observable X calculated at the initial configuration 
at t = in the MD evolution and X(t r — t r ) the observable calculated at 
the reversed configuration which is obtained from the initial configuration at 
t = by integrating the MD equation to t = t r and then integrating back to 

Table 2 

Numerical results with PHMC (am = 0.02, aA max = 2.28) on the large size lattice. 
(P) : averaged plaquette. ^Mult/traj : averaged number of multiplication of M eo 
and M oe to achieve unit trajectory. 



Npoly 


300 


400 


500 


R algorithm [32] 


[dt, N MD ] 


[0.02,50] 


[0.02,50] 


[0.02, 50] 


[0.02,50] 


Trajectories 


1700 


1050 


800 


300 


#Mult/traj 


61291(183) 


73176(296) 


87955(350) 




(P) 


0.577099(46) 


0.577130(46) 


0.577023(43) 


0.577261(49) 


HMC 

Acceptance 


0.8059(103) 


0.7962(168) 


0.7775(194) 




(dH) 


0.1112(126) 


0.1359(147) 


0.1497(187) 




Correction 
Acceptance 


0.7837(128) 


0.9627(70) 


0.9871(45) 




(dS) 


0.1331(164) 


-0.0002(29) 


0.0000(6) 




Total 
Acceptance 


0.6329(122) 


0.7657(168) 


0.7675(191) 
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Fig. 11. The convergence behavior of the Lanczos-based algorithm for dS calculation 
as a function of the tolerance level. Upper figure shows the number of iteration in 
the Lanczos algorithm, middle one shows the residuals defined in Eqs. (7.5) and 

(7.6) for (W[D 00 ]) N f/ 8 and {W[D' 00 \y Nf/4 , bottom one for the \dS - dS end \ where 
dS en d is the value of dS itself at tol = 10 -14 . 

t — 0. The trajectory length is t r = 1. We measured these quantities using 20 
configurations which are separated by 5 trajectories. We observe 

(AH/H) =0.26(6) x 10~ 15 , 
{AP}/^9 x 4 x iV vol = 0.4162(7) x 10" 14 , 

(AU)/fi x 4 x N vol = 0.1484(2) x 10~ 14 , (7.4) 

with the iVpoiy = 300 PHMC algorithm. The errors are estimated with the 
binned jackknife method. These values are at a completely satisfactory level 
with the double precision arithmetic. Consequently it is concluded that the 
method we employed for the force calculation of the polynomial pseudo- 
fermion is stable in the MD evolution and does not cause violation of re- 
versibility. 

The validity of the CG based stopping criterion for the Lanczos method to 
calculate dS in the noisy Metropolis test is also investigated on the same 20 
configurations. We measured dS, and two residuals denned by 
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ri 
r 2 



W[D 00 ]ff/ s Y /Nf Vo -W[D 00 } Vo 
W[D' 00 ) ((W[D' 00 \r Nf/4 y /Nf Co-C 



/\W[D 00 } Vo \, (7.5) 

/\a (7.6) 



where ( Q is defined in Eq. (5.3), and the number of iteration of the Lanczos 
iteration by varying the tolerance level tol for err\. 

Figure 11 shows the convergence behavior of the above quantities. The number 
of iteration increases step by step with the decreasing stopping condition (top 
of the figure). The residuals Eqs. (7.5) and (7.6) stagnate around a O(10~ 12 ) 
level (middle of the figure). As the residuals are defined as relative error, one 
may suspect that the number of O(10~ 12 ) is not sufficient with the double 
precision arithmetic. We think this is due to the accumulation of round-off 
errors in applying the Lanczos iteration several times to calculate the residuals. 
Namely, for the explicit residual calculation, we need four times application 
of the Lanczos iteration for r\ and twice for r 2 in the Nf = 2 case, and the 
Lanczos iteration does not span the completely same Krylov subspace in each 
application. 

For the correctness of the algorithm, dS itself is more important. To see the 
stopping condition dependence of dS, we measured the convergence of \dS — 
dS en d\ as the metric and show the result in the bottom of Figure 11, where 
dS en< i is dS at the most stringent stopping condition tol = 10~ 14 . Since the 
change is too rapid against the change of the number of iteration, we could not 
see the exponential decay. The metric stagnates around O(10~ 10 ). The reason 
is understood as follows. dS is defined as the difference of ClW[D' 0C \~ N fl 4: ( !0 and 
| ?7 1 2 in Eq. (5.3). Numerically it is observed that (lW[D' 00 ]~ N f ' 4 £ and \i] \ 2 . 
have almost the same values of about 3xl6 4 /2 ~ 0(1 5 ), and the resulting dS 
is of O(10 _1 ). Within double precision arithmetic, the subtraction of O(10 5 ) 
from O(10 5 ) yielding O(10 _1 ) for dS means that dS only has 9-10 significant 
figures. The stagnation around O(10 -10 ) would occur in such a case. We stress 
that the 9-10 significant figures for dS is sufficient in the realistic simulations 
with O(10 4 ) trajectories. No visible effect from the choice of the CG-based 
stopping criterion is observed. 



The plaquette value of the i?-algorithm from Ref. |32j is larger than that of 
the PHMC algorithm about 2a as shown in Table 2. This may contradict the 
previous observation on the small size lattice where the plaquette value of the 
/^-algorithm approaches the value of the PHMC algorithm from smaller value. 
To make clear the dt dependence of the plaquette with the /^-algorithm, we 
imported the program of the /^-algorithm into SR8000-F1 model and produced 
several trajectories with the i?-algorithm on the large size lattice. 

Table 3 shows the results with the i?-algorithm on the large size lattice. The 
definition of the norm res for the stopping criterion of the CG algorithm 
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required in the i?-algorithm is the same as that of Ref. |32| (where the symbol 
r is used instead of res). We do not observe clear stopping criterion dependence 
on (P). Figure 12 shows the dt dependence of (P) on the large size lattice. 
Open circles are the results with the i?-algorithm produced for the comparison. 
Filled square is the previous result with the i?-algorithm [i^. Open triangles 
are the results with the PHMC algorithm scatted around dt = 0.02 for clarity 
of presentation. We observe that the dt behavior of the /^-algorithm in small 
dt region is slightly different from that of the small size lattice (Fig. 8), while 
the behavior at large dt region is similar to each other. We also observe that 
the value of dt where (P) largely deviate from the value at the limit dt ^ 
is different (it is dt ~ 0.01 in Fig. 8 and dt ~ 0.04 - 0.05 in Fig. 12). 

Although we cannot make clear statement on the discrepancy of dt behavior 
between two lattice sizes, we can say that the dt dependence is affected by 
the physical situation and parameters. The reason is as follows. The error 
analysis of the .R-algorithm by dt perturbation fails at large dt. More precisely 
it is said that the point where the perturbative analysis fails is governed by 
dt/m with m the lightest fermion mode in the R- algorithm [33]. We do not 
tune the input parameters for these two simulation sets. It is natural that 
the physical lightest fermion mode is different between the small and large 

Table 3 



The results with the /^-algorithm on the large size lattice (16 4 ,/3 = 5.7, am = 0.02). 
res is used for the stopping criterion of the CG algorithm in the i?-algorithm. 



res 


10" 4 


10" 4 


10" 4 


10" 4 


[dt, N M d] 


[0.005,200] 


[0.01,100] 


[0.02,50] 


[0.025,40] 


# of Traj. 


800 


1200 


1200 


1000 


#Mult/traj 


113210(450) 


56700(355) 


28627(113) 


23102(76) 


(P) 


0.577102(63) 


0.577133(67) 


0.577194(71) 


0.577294(67) 


res 


10" 4 


10" 4 


10~ 4 


io- 8 


[dt, N MD ] 


[0.04,25] 


[0.05,20] 


[0.0625,16] 


[0.02,50] 


# of Traj. 


1500 


2000 


2500 


1000 


#Mult/traj 


14432(130) 


11808(42) 


9478(91) 


77880(807) 


(P) 


0.577389(71) 


0.577076(100) 


0.576423(232) 


0.577411(67) 


res 


lO" 12 


lO" 15 






[dt, N MD ] 


[0.02,50] 


[0.02,50] 






# of Traj. 


1000 


800 






#Mult/traj 


127772(1162) 


166573(424) 






(P) 


0.577335(63) 


0.577242(57) 
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Fig. 12. MD step size dependence of the averaged plaquette on the large size 
lattice. The dotted curve shows the fit with f(dt) = adt 2 + bdt 3 + c to open circles 
with the /^-algorithm. Filled square is the result from Ref. |32T |. 



size lattices. Thus we consider that this discrepancy comes from the different 
physical parameter and situation between the small and large size lattices and 
(P) with the /^-algorithm at dt = 0.02 is accidentally larger than that with 
the PHMC algorithm in the large size lattice. We stress that the result from 
the PHMC algorithm is consistent with the limit dt —*■ of the results with 
the i?-algorithm. 



The computational cost of the PHMC algorithm of N po \ y = 400 is comparable 
to that of the i?-algorithm with res = 10~ 8 at dt = 0.02 in terms of #Mult/traj 
on the large size lattice. We need to take the limit dt — > and use sufficiently 
small res in the i?-algorithm theoretically, which requires significantly large 
amount of computational cost for the i?-algorithm. The actual computational 
time for the PHMC algorithm with iVp iy = 400 on the large size lattice 
was measured as 136 sec for unit trajectory with SR8000-F1 4-nodes (peak 
speed : 12x4 GFlops, sustained speed : 3.5 x 4 GFlops) and it is 120 sec 
for the i?-algorithm with res = 10~ 8 (sustained speed : 3.4x 4 GFlops) at 
dt = 0.02. The same computational speed is achieved because both programs 
make use of the common subroutine for the hopping matrix multiplication. 
Thus we conclude that the PHMC algorithm works on the lattice size of 16 4 
with a quark mass am = 0.02 corresponding to mpg/my ~ 0.69 within 
reasonable computational time and is applicable to realistic simulations. The 
advantage of exact algorithm is very clear in this situation. 
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8 Conclusion 



In this paper, we have presented an exact algorithm for dynamical lattice 
QCD simulation with the KS fermions where single flavor quark is defined as 
the 1/4 power of the KS fermion matrix. The algorithm is an extension of the 
polynomial Hybrid Monte Carlo (PHMC) algorithm in which the Hermitian 
polynomial approximation is applied to the fractional power of the KS fermion 
matrix. The Kennedy-Kuti noisy Metropolis test is incorporated to make the 
algorithm exact. 

We introduced two types of polynomial approximations and corresponding 
PHMC algorithms. One algorithm approximates Do^ 1 ^ with a single poly- 
nomial Pn po1y [D 00 ] (case A), and the other Doo 1 ^ with a squared polyno- 
mial |Q7v poly [-Doo]| 2 (case B). For the noisy Metropolis test, we made use of a 
Lanczos-based Krylov subspace method to calculate the fractional power of 
the correction matrix. The efficiency of the two algorithms was estimated, and 
it was found that the latter (case B) had better performance than that of the 
former by a factor two. 

We tested our algorithm (case B) using the Hermitian Chebyshev polynomial 
for the case of two-flavor QCD on three lattice sizes. Results on a small lattice 
of 8 3 x 4 demonstrated that the algorithm works correctly, e.g., the averaged 
plaquette value agrees with that of the _R-algorithm after extrapolation to 
the zero step size limit. We have also shown that our algorithm works on a 
moderately large lattice of size 16 4 , albeit for a rather heavy quark mass of 
mvs/mv ~ 0.69, within reasonable simulation costs compared to that of the 
i?-algorithm. 

There are several points that require improvements with our work. One of 
the points concerns the fact that the calculation of the polynomial coefficients 
in case B for splitting the original polynomial becomes progressively difficult 
toward lighter quark masses. While this is not a limitation of the PHMC 
algorithm itself, solutions should be found to solve this problem for future 
realistic simulations since the case A algorithm, which has no such problem, is 
expected to be twice slower than than the case B algorithm. Another point is 
that further improvement of the algorithm may be achieved by optimizing the 
polynomial approximation, and by combining the preconditioning technique 
and the polynomial approximation. 

Anticipating progress on these fronts, we conclude that our algorithm provides 
an attractive method for dynamical KS fermion simulations for 2 + 1-flavor 
QCD without systematic errors originating from the simulation algorithm. 
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